Untangling influences of hydrophobicity on protein sequences and structures 
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We fit the Fourier transforms of solvent accessibility and hydrophobicity profiles of a representative 
set of proteins to a joint multi- variable Gaussian. This allows us to separate the intrinsic tendencies 
of sequence and structure profiles from the interactions that correlate them; for example, the a- 
helix periodicity in sequence hydrophobicity is dictated by the solvent accessibility of structures. 
The distinct intrinsic tendencies of sequence and structure profiles are most pronounced at long 
periods, where sequence hydrophobicity fluctuates more, while solvent accessibility fluctuations are 
less than average. Interestingly, correlations between the two profiles can be interpreted as the 
Boltzmann weight of the solvation energy at room temperature. 



How the sequence of amino acids determines the struc- 
ture and function of the folded protein remains a chal- 
lenging problem. It is known that hydrophobicity is an 
important determinant of the folded state; hydrophobic 
monomers tend to be in the core, and polar monomers on 
the surface |l|,|2|,y,|^. Several studies have examined the 
correlations in the hydrophobicity of amino-acids along 
the protein chainp,|lj|jila] > which are in secondary struc- 
ture prediction [2j , and in the design of good folding se- 
quences |l(1 |. Naturally, sequence correlations arise from 
locations of the amino-acids in the folded protein struc- 
ture, and are best interpreted in conjunction with solvent 
accessibility profiles (which indicate how exposed a par- 
ticular amino-acid is to water in a specific structure). For 
example, Eisenberg et al [2j note that for secondary struc- 
tures lying on the protein surface, which have a strong 
periodicity in their solvent accessibility, hydrophobicity 
profiles also exhibit the period of the corresponding a 
helix or (3 strand. Constraints from forming compact 
structures induce str ong correlations in the solvent acces- 
sibility profile 0)lll|>llj,u3> which should in turn induce 
similar correlations in the hydrophobicity profiles. It is 
desirable to quantify and separate the resulting correla- 
tions in protein sequences and structures. 

In this paper, we aim for a unified treatment of hy- 
drophobicity and solvent accessibility profiles, and the 
interactions between them. The sequence of each pro- 
tein is represented by a profile {hi}, where hi is a stan- 
dard measure of the hydrophobicity of the i-th amino- 
acid along the backbone [14j. Its structure has a profile 
{si} for i = l,2, • • • , N, where Si is a measure of the 
exposure of the amino-acid to water in the folded struc- 
ture [2j . While we do not expect perfect correlations be- 
tween these profiles, we can inquire about the statistical 
nature of these correlations, and in particular whether 
they are diminished or enhanced at different periods. To 
this end, we employ the method of Fourier transforms 
, and examine the statistics of the resulting amplitudes 
{hq,s q }, and power spectra {|/i 9 | 2 , |s 9 | 2 }, for a database 
of 1461 non-homologous proteins. In a sense, this can be 
regarded as extending the work of Eisenberg et al J3| who 
explore correlations between hydrophobicity and solvent 



independent of specific locations along the backbone. Of 
course, the use of Fourier analysis is by no means new, 
and has for example been employed to study hydropho- 
bicity profiles |3ll5l ll5lll6l| . However, we are not aware of 
its use as a means of correlating sequence and structure 
profiles. 

Our results suggest that the hydrophobicity and sol- 
vent accessibility profiles are well approximated by a 
joint Gaussian probability distribution. This allows us 
to obtain the intrinsic correlations in the hydrophobicity 
profile, as distinct from correlations induced by solvent 
accessibility. For example, the cv-helix periodicity in hy- 
drophobicity profiles is shown to be induced by the corre- 
sponding periodicity in the solvent accessibility profiles. 
We also find that at long wavelengths the two profiles 
have different intrinsic characteristics: solvent accessibil- 
ity profiles are positively correlated while hydrophobicity 
profiles are anti-correlated. Interestingly, the coupling 
between the two profiles is independent of wave-number, 
and hence can be interpreted as the Boltzmann weight of 
the solvation energy. The corresponding temperature is 
close to room temperature, consistent with the "mean" 
temperature estimated in previous work from the fre- 
quencies of occurrence of amino acid residues in the core 
and on the surface |l7], [l8| • 

For our protein data set, we selected 2200 representa- 
tive chains from the Dali/FSSP database. Any two pro- 
tein chains in this set have more than 25 percent struc- 
tural dissimilarity. We removed all the multi-domain 
chains by using the CATH domain definition database, 
leaving 1461 protein chains |ll. I lit l20|. The hydropho- 
bicity profiles, {hi}, were generated from the sequence 
of amino-acids using the experimentally measured scale 
of Fachere and Pliska |21| (in units of kcal/mol). We 
used the relative solvent accessibility reported by NAC- 
CESS 22] to generate solvent accessibility profiles {s;|. 
(The relative solvent accessibility is the ratio of the sol- 
vent accessibility of a residue to the solvent accessibility 
of that residue in an extended tripeptide ALA-X-ALA 
for each amino acid type X.) We then computed the cor- 



responding Fourier components as 
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where q = lixajN , with a = 0, 1, • • • , iV— 1, and similarly 
for h q . (The average values were subtracted to remove 
the DC component in the Fourier transform.) 
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pate that the Fourier components for different q are in- 
dependently distributed (i.e. P({h q ,s q }) = Y[ q p(h q ,s q )) 
for the following reasons: (i) For the subgroup of cyclic 
proteins j 2 ■ -J | the index i is arbitrary, and the counting 
can start from any site. The invariance under relabel- 
ing then implies that the probability can only depend 
on i — j, and hence separable into independent Fourier 
components. This exact result does not hold for open 
proteins because of end effects, but should be approx- 
imately valid for long sequences when such effects are 
small, (ii) Numerical analysis of a lattice model of pro- 
teins in Ref. [12J confirms the exact decomposition into 
Fourier modes for cyclic structures, and its robustness 
even for open structures of only N = 36 monomers. To 
test this hypothesis, we examine all possible covariances 
involving {h q ,s q } for different q. Note that the Fourier 

and simi- 



amplitudes are complex (i.e. s q — $is q 



'</• 



larly for s q ), and hence there are 4x4 covariancc plots, 
such as in Fig. 2(a) I for the covariance of dis q with itself. 
In all cases we find that the diagonal terms are small; 
the only exceptions are at small q where we expect end 
effects to be most pronounced. 
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FIG. 1: Power spectra from averaging over 1461 proteins, for 
(a) solvent accessibility (there are no units for \s q \ 2 , since it 
based on accessibility relative to solution) ; (b) hydrophobicity 
(the units of \h q \ 2 are (k cal/mole) 2 ). The plus signs in each 
case are obtained from random permutation of the sequences. 



Our results for the power spectra of solvent accessibil- 
ity and hydrophobicity profiles are indicated respectively 



in Figs. 1(a) and 1(b) (q is related to the periodicity A 



through A = — ). A prominent feature of both plots is 
the peak at the a-helix periodicity A = 3.6 [3j. Its pres- 
ence in the solvent accessibility spectrum indicates that 
solvation energy plays a role in the spatial arrangement 
of a helices when they lie on the surface, the hydrophobic 
monomers are more likely to be exposed to the solvent. 

I 

We would like to untangle correlations between the 
two profiles, so as to determine their intrinsic tendencies, 
by finding a joint probability distribution P({si}, {hi}). 
Clearly this cannot be decomposed as a product of contri- 
butions from different sites i, as neighboring components 
such as Si and Sj+i, are highly correlated. We antici- 
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FIG. 2: (a) Covariance of 5fts g with $ls q r . There is very lit- 
tle correlation between off-diagonal terms, (b) Scatter plot 
of lkh q versus 5Rs g for q = 0.90, and the half-width half- 
maximum locus of a Gaussian fit (solid line) . 



One can make a similar case for the independence of 
the real and imaginary components at a given q. (For 
cyclic structures the phase is arbitrary.) The real (imag- 
inary) components are, however, correlated as illustrated 

for q 



, ■J\.S n 



0.9 in Fig. 2(b) 



by the scatter plot of (3?/i g , u<.o q 
We made similar scatter plots for different values of q in 
the interval to 7r, with similar results which were well 
fitted to Gaussian forms. Based on these results, we de- 
scribe the joint probability distribution in Fourier space 
by the multivariate Gaussian form 



P({h q ,s q }) = J|cxp 
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with the parameters plotted in Fig. |31 If the probabili- 
ties depend only on the separation i — j between sites, 
the real and imaginary Fourier amplitudes should follow 
the same distribution. In our fits we allowed the corre- 
sponding parameters to be different to obtain a measure 
of the accuracy of the model and the fitting procedure. 
As indicated in Fig. the resulting values are quite close, 
differing by less than 5%. 

We interpret {A q } and {C q } as measures of intrinsic 
tendencies of hydrophobicity and surface exposure pro- 
files, while {Bq] indicates the strength of the interactions 
that correlate them. In the absence of any such interac- 
tions, {A q } and {C q } would be the same as the power 
spectra in Fig.^ With this in mind, let us now examine 
these plots in more detail. 

The prevalence of a-helices in structures is reflected in 
the peak at A = 3.6 in Fig. 3(a) As a check, we repeated 
the analysis for 493 proteins in our database that are 
classified as mainly /3 by CATH 20]. The a-helix peak 
disappears completely for this subset, and a weaker peak 
corresponding to j3 strands at A = 2.2 (which was not vis- 
ible in Fig. 3(a) I emerges in its places. This may indicate 
that the formation and arrangement of (3 strands is less 
influenced by hydrophobic forces. The other prominent 
feature of Fig. 3(a) is the increase in A q as q — > 0. We 
believe this reflects the fact that at a coarse level the pro- 
tein is a compact polymer, it is well known that polymer 
statistics leads to long-range correlations in the statistics 
of segments in the interior of a compact structure jllj . 
While the precise manner in which this could lead to cor- 



relations as in Fig. 3(a) has not been worked out, we note 
that similar effects have been observed before in studies 
of protein-like structures in three dimensio ns 1131 . and 
compact lattice polymers in two dimensions |l2(. 

The a-helix peak, which is prominent in the hy- 



drophobicity power spectrum of Fig. 1 (b) is absent from 
Fig. 3(c) Thus, the observed periodicity in sequence 
data is not an intrinsic feature of the amino-acid pro- 
files, but dictated by the required folding of structures. 
If the sequence of amino-acids were totally random, we 



would expect a distribution P({hi}) = YliPa(hi), where 
p a {hi) indicates the frequency of a particular base. The 
corresponding distribution in Fourier space would also be 
independent of q. The observed {C q } are indeed constant 
(approximately 0.42 ± 0.02), at large q. This constant is 
different from the average indicated in Fig. 1(b) with 



the assumption that the amino-acids are distributed ran- 
domly. This difference is due to the interaction term in 
equation [3 

Reduced values of C q are observed as q — > 0, corre- 
sponding to large periodicities, as seen in Fig. 3(c) A 



similar feature is also present in the power spectrum in 
Fig. 1(a) as noted before by Irback et al. [2J] who suggest 
that anti-correlations can be advantageous for removing 
the degeneracies of ground state for folding sequences. 
More recent studies also indicate that long stretches of 
hydrophobic monomers, which could be a source of long 
range positive correlations, are avoided [23. Further in- 
vestigations of this issue would be helpful. 

Finally, we note that the interaction terms {B q } in 
Fig. 3(b) which correlate sequence and structure profiles 
(at different periodicities) are approximately constant. 
As J2 q h q s q * = J2i hiSi, these terms can be regarded as 
arising from the Boltzmann weight exp[— E /(ksT)] of a 
solvation energy E = J2i^i s i a ^ some temperature T. 
Using B q pa 0.32 ± 0.03 kcal/mol, we can extract a cor- 
responding temperature of T = (2B q )/ks — 323 ± 30°K. 
Interestingly, this fictitious T is around room temper- 
ature, i.e. in the range of temperatures that most pro- 
teins fold and function. This indicates that an important 
factor in correlating sequence hydrophobicity, and struc- 
tural solvent accessibility is indeed the free energy of sol- 
vation. This conclusion is also consistent with the anal- 
ysis done by Miller [l^, llH , which estimated differences 
in the free energies of amino-acids between the surface 
and the core of the proteins by counting their relative 
frequencies in the different locations. 

In principle, the Gaussian distribution in Eq.[21can be 
used as a tool for predicting structures, at least as far as 
their surface exposure profile is concerned. Given a spe- 
cific sequence, we can calculate the hydrophobicity pro- 
file {hi}, and the corresponding {h q }. The conditional 
probability for surface exposure profiles is then given by 



P({s q \h q }) =UMS 
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Thus s q is Gaussian distributed with a mean value of 
Xqhq, and a variance a q , with the 'susceptibility' Xq> an d 
the 'noise' a q easily related to (A q ,B q ,C q ). The corre- 
sponding distribution of {si} in real space is then ob- 
tained by Fourier transformation. 

We investigated correlations between protein se- 
quences and structures due to hydrophobic forces, by 
application of Fourier transforms to profiles of hydropho- 
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FIG. 3: Intrinsic variances of solvent accessibility and hydrophobicity profiles are described by A q and C q respectively, while B q 
is related to the interaction that correlates them. The square and circle symbols correspond to the parameters of the imaginary 
and real components, respectively. These figures are calculated for our set of 1461 proteins. Dashed lines indicate respectively 
the average value of B q [in (b)], and the asymptotic behavior of C q [in (c)]. 



bicity and solvent accessibility. Each Fourier component 
is separately well approximated by a Gaussian distribu- 
tion; their joint distribution is described by a product 
of multivariate Gaussians at different periodicities. This 
approach enables us to separate the intrinsic tendencies 
of the profiles from the interactions that couple them. Wc 
thus find that a- helix periodicity is a feature of structures 
and not sequences, and that at long periods the structural 
profiles are more correlated than average, while the se- 
quences are less correlated. A quite satisfying outcome is 
that the correlations between the two profiles can be ex- 
plained by the Boltzmann weight of the solvation energy 
at room temperatures. 

Our joint distribution can be used in applications such 
as predicting solvent accessibility from hydrophobicity 
profiles |2fJ, or protein interaction sites 27j. Incorpo- 
rating the impact of correlations within solvent accessi- 
bilities is likely to improve predictions. The distribution 
can also be used in analytical approaches to protein fold- 
ing, wherever there is a need for taking into account the 
complexities of structure and sequence space. 
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